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Abstract: We characterize the statistical bootstrap for the estimation of information- 

I theoretic quantities from data, with particular reference to its use in the study of large-scale 

^ 3 social phenomena. Our methods allow one to preserve, approximately, the underlying 

^ 4 axiomatic relationships of information theory — in particular, consistency under arbitrary 

O 5 coarse-graining — that motivate use of these quantities in the first place, while providing 

^ 6 reliability comparable to the state of the art for Bayesian estimators. We show how 

^ 7 information-theoretic quantities allow for rigorous empirical study of the decision-making 

^ 8 capacities of rational agents, and the time- asymmetric flows of information in distributed 

^ 9 systems. We provide illustrative examples by reference to ongoing collaborative work on 

. the semantic structure of the British Criminal Court system and the conflict dynamics of the 
contemporary Afghanistan insurgency. 

12 Keywords: biological systems; cognition; social systems; information theory; statistical 

13 estimation; bootstrap; Bayesian estimation 



14 A major function of many biological and social systems is to encode, process, and share information. 

15 The functional forms of the information-theoretic quantities used to describe these aspects of a system 

16 are given to us by deduction from a remarkably small set of axioms. 

17 Estimation of these quantities is not trivial. When done carelessly, it can violate these underlying 

18 axioms, introduce spurious signals, lead to sensitive dependence on what should be innocuous choices 

19 of data representation, and create inconsistencies between estimation methods that otherwise should have 

20 been equivalent. 
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21 This paper address the utility of information-theoretic concepts, and characterizes a useful set of 

22 methods for estimating them well. In particular, we first present a method — the statistical bootstrap — for 

23 estimating some of the most important information-theoretic quantities. The method preserves, 

24 approximately, the relevant axioms. 

25 We shall show in particular that it outperforms both "naive" and Bayesian estimators in a regime 

26 of particular interest: when n, the number of samples, is at least as large as k, the number of bins, 

27 event-types, or categories. Not all empirical work satisfies this constraint, and much effort has been 

28 devoted to the under-sampled regime or to continuous data; however, a great many problems do, and 

29 these are the ones we are concerned with here. We shall show also that the bootstrap can provide reliable 

30 error estimates. 

31 At the same time, we introduce, for the benefit of those working in the empirical sciences and who 

32 may be less familiar with the utility and power of information theory, the axioms themselves and their 

33 direct utility in producing consistent and coherent accounts of the role that information, signaling, and 

34 prediction play in the real world. We do so by reference to two real- world examples, so as to provide an 

35 explicit guide for how information theory allows the phrasing, and answering, of vital questions. 

36 We begin, in Sec. 1, with the entropy estimation problem. This section introduces the major technical 

37 themes of the paper: coarse-graining, the axiomatic foundations of information theory, and the use of 

38 the statistical bootstrap. 

39 We then consider two uses of information theory in the real world. The first, considered in Sec. 2, 

40 is to measure, in a principled fashion, how much two distributions differ. We describe and interpret 

41 two measures, the well-known KuUback-Leibler divergence and the less-well-known, but often better- 

42 behaved, Jensen-Shannon divergence. We then show quantifying the differences between distributions 

43 allows us to bound the probability of error made by participants in the system. We provide an illustrative 

44 example by reference to an on-going research project in the information-theoretic structure of the British 

45 Criminal Court system. 

46 The second, considered in Sec. 3, is to measure the extent to which two patterns of behavior are 

47 synchronized. We emphasize the advantage of mutual information over less-principled measures such as 

48 the Pearson cross-correlation coefficient, with particular reference to the data processing inequality. We 

49 provide an illustrative example by reference to an on-going research project in the nature of decision- 

50 making in the Afghanistan insurgency. 

51 All of the results presented rely on the use of the statistical bootstrap. In Sec. 4 we detail numerical 

52 results on the use of this technique. We show how well the bootstrap corrects for bias; how well it 

53 preserves the relevant axioms (and their consequences); and how reliable its error estimates are. Our 

54 goal in this section is to provide an accurate guide for practitioners in the use of the bootstrap and to 

55 ground the explanations and accounts of the previous two sections. We conclude in Sec. 5. 

56 1. Estimating Entropy 

57 Information Theory deals with probability distributions, p, over outcomes. Its most fundamental 

58 quantity is entropy, which can be interpreted as the uncertainty of outcome when drawing from p. 
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59 Shannon's original paper [1] establishes that the entropy function, H{p), takes a unique form, given 

60 the assumption of continuity and two additional conditions — 

61 1. Uncertainty principle. When all k entries of p are equal, H(p) should be a monotonic, increasing 

62 function of k. 

63 2. Consistency under coarse-graining. H{pi,p2,p3) is equal to H{pi,p2 + ps) + (p2 + Pz)H{p2,Pz)- 

64 Condition 1 says that the uncertainty should rise when there are more possibilities (and all outcomes are 

65 equally likely). Condition 2 says that if one groups two outcomes, then the uncertainty is the uncertainty 

66 of the more coarse-grained description, plus the (weighted) uncertainty of outcomes from the grouped 

67 category. 

68 Our goal in this section will be the extension of Shannon's theory to discrete counts, n, of observations 

69 such that these axioms can, in some more or less exact fashion, be preserved. Depending on what limits 

70 the axioms are satisfied in {e.g., whether they hold only on average, or for large data), we consider 

71 functions to be more or less strict. The strictest demand requires 

72 1'. Uncertainty principle. When all k entries of n are equal, H{n) should be a monotonic, increasing 

73 function of k. 

74 2'. Consistency under coarse-graining. 712,^3) is equal to H{ni,n2 + ns) + "2+"^ ^(77.^^ -^3)^ 

75 where n is the total number of observations. 

76 3'. Asymptotic convergence. As n goes to infinity, H(n) — )• H{p). 

The simplest solution to this problem is to use the so-called naive estimator of the entropy, 

H{n) = H{{'^,...,'^})=H{p{n)), (1) 

77 where p{n) is often called the empirical distribution. This satisfies Conditions 1 and 2 by construction, 

78 while satisfaction of Condition 3 is a consequence of the Asymptotic Equipartition Property. By a slight 

79 abuse of notation, we will write H{n) in place of H{p{n)). 

80 As proven in Ref. [2], any estimator of entropy is necessarily biased — meaning that estimates of H 

81 made on a finite sample will, on average, disagree with the asymptotic value. It is well known, for 

82 example, that the naive estimator above tends to underestimate the entropy of a system, and can be quite 

83 biased indeed for small n (the first order correction), or small Pi <til/k'^ (the second order correction). 

In an attempt to reduce the bias on the naive estimator, we can attempt a bootstrap correction. 

iJeorr(r?) = H{n) - [{H (fi*)) p^n*\n) - H{n)] , (2) 

84 where n* is constrained to sum to n; here P(n*\n) is the probability of drawing a set of ^ rij 

85 observations, n*, from an empirical distribution given by n. Eq. 2 estimates the bias of H{n) compared to 

86 H(p) by estimating the bias of random draws from p, the maximum-likelihood estimator of p, compared 

87 to H{n). If the relevant properties of p are captured by p, this should be a reasonable approximation. 

88 Fig. 5 in the Appendix illustrates explicitly the logic of the bootstrap, the implementation of Eq. 2, 

89 and the means by which one can obtain not only a bootstrap-corrected estimate, but also error ranges for 

90 that estimate. 
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The bias correction of Eq. 2 also violates the consistency conditions that should related i^corr 
computed on fine- vs. coarse-grained distributions (Condition 2). For example, we may expand the 
{H{n*)) term of Eq. 2 as 

(i/(n*))p(^.|„^) = (H{a*) + ^i±^/7(6*)\ , (3) 

91 where a* is {n\^n\ + ^3}, and h* is {n\^ 77,3}. Conversely, if we had computed -ffcorr in two steps, on the 

92 coarse-grained n and then the fine-grained subspace, the expansion of the equivalent terms would have 

93 given us 

(-f^(^*))p(n*|n) = (-f^(a*))p(a*|{ni,n2+n3}) 

94 Eqs. 3 and 4 are identical except for the fact that the second expectation value in Eq. 4 fixes the prefactor 

95 to the maximum-likelihood value, while the expectation value in Eq. 3 places no such constraint. Any 

96 coarse-graining enforced post hoc will enforce a condition on the subset that is not enforced, by the 

97 bootstrap, in the aggregate. Consistency violations, however, are slight. Because of the approximate 

98 satisfaction of Condition 2' by the bootstrap estimator, we can recover, approximately, nearly all of the 

99 structure of information theory in the finite-data limit. As we shall show in detail in Sec. 4, the RMS 

100 error, even for n equal to is roughly one-hundredth of a bit. 

101 We turn now to the properties of information-theoretic quantities for the study of two central 

102 problems: distinguishability (Sec. 2) and synchronization (or signalling; Sec. 3.1). 

103 2. Distances Between Distributions 

104 In the year 1820, John Long was brought before a judge at the Old Bailey — the central criminal court 

105 of London, England — on a charge of breaking the peace. The full transcript of the trial was reported 

106 in the court proceedings of September 18th, 1820.' Despite the seriousness of the charge (in the prior 

107 twenty years, seven in ten guilty verdicts for breaking the peace resulted in a death sentence) the full 

108 transcript for Long's trial is just under 400 words . 

109 As part of an on-going collaborative research program on the nature of institutional decision- 
no making [3], we would like to know how much information the transcript contains about the outcome. 

111 Such a question is naturally phrased in terms of a distance or divergence: how "far apart" are trials, for 

112 example, that lead to guilty vs. not-guilty verdicts. 

113 A contemporary of Long's, hearing of his being brought to trial, would have had quite a bit of 

114 information about what was likely to happen. She might know, as we do now, for example, that roughly 

115 three-quarters of all defendants at the Old Bailey the year before were given guilty verdicts (but only one 

116 in seven, when restricted to those indicted for breaking the peace). She might know more: that men were 

117 slightly more likely to be found guilty than women, for example (a difference of three percentage points 

118 in conviction rates) . 

'http://www.oldbaileyonline.org/browse.jsp?id=tl8200918-112&div=tl8200918-112; see Appendix, Sec. 6. 
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119 Once the trial had begun an observer would expect to refine her beliefs about what would happen. 

120 Informally, we say that the transcript carries information about the outcome, and a good observer would 

121 be sensitive to it. 

122 The transcript carries information, of course, about a great many things, including the legal and moral 

123 intuitions of the participants, their relative social status, and more or less reliable information about the 

124 actual events that the defendant is accused of taking part in. What we are interested in is the extent to 

125 which this talk, truthful or not, provides any signal of the underlying decision-making in the legal system 

126 itself. This question is independent of causal mechanisms: whether the transcript records an input to the 

127 decision-making process, or whether it is simply a symptom of hidden variables whose values are set by 

128 other means. 

129 Answering this question will tell us a great deal: not only about the extent to which the goings- 

130 on in the courtroom reflect actual features of the decision-making process, but also about the amount 

131 of information in principle available to actual observers — including participants who might alter, for 

132 strategic purposes, the information content and capacity of their behavior, or who might respond to 

133 information contained in the behavior of others. 

134 To answer this question, we measure, using the data to hand, two distributions over transcript features. 

135 One distribution is constructed from trials with guilty outcomes, and one from those with not-guilty 

136 outcomes. As John Long's trial proceeds, we build up an empirical distribution over categories. If the 

137 transcript features we have identified are indeed information-bearing, in doing so we will learn something 

138 about which of the two distributions the trial is more likely to have been drawn from. 

139 We focus on the lexical structure of the transcripts. We first measure how many times different words 

140 appear, dropping all information about the ordering of those words within a transcript. We draw on 

141 computational linguistic tools to split words by part of speech: for example, we distinguish whether 

142 'dog' is being used as a noun or verb. 

143 We then map these word counts to a more coarse-grained set of (one hundred and sixteen) semantic 

144 categories, and use this to build up an empirical distribution for the trial at hand. This amounts to an 

145 assumption of feature-independence: the claim that, while many aspects of language are clearly order- 

146 dependent (syntax, for example, or the turns in which people speak), we shall consider only features 

147 whose arrival order does not itself carry information.^ 

Explicitly, then, we can then define two (empirical) distributions, p (for trials that have guilty verdict 



148 where riik is the number of counts of words in semantic category k found for (guilty-verdict) trial i, Ug 

149 is the total number of guilty trials, and n is the total number of semantic hits (i.e., the sum over all nj^). 

150 We define q, for trials with not-guilty outcomes, similarly. 

^This is not exactly true, since we use part-of-speech information for our semantic classifier, and the means (a hidden 
Markov model) for identifying a word's lexical category are sensitive to order and context: compare "the dogs bit the sailor" 
['dogs' as plural noun] and "the sailor dogs the waitress" ['dogs' as verb]. It is an interesting and open question the extent to 
which syntactical features convey semantic information. 



outcomes). 




(5) 
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151 These approximations are sufficient to turn a qualitative question, about the availability of information 

152 in trial transcripts, into a form amenable to an information theoretic analysis. 

153 We consider two distinct ways to answer the transcript information question. Both are formulated 

154 in terms of a distance, difference or "divergence" between the distribution over categories for guilty- 

155 outcome transcripts vs. the distribution over categories for the not-guilty outcomes. 

156 Both methods seem to quantify essential aspects of the question. The first we consider, the Kullback- 

157 Leibler divergence, is well known, but potentially unsuited to empirical work. The second, the Jensen- 

158 Shannon divergence, is sufficiently well-behaved that it allows for bootstrap bias-correction and error 

159 estimation. 

160 In the final subsection, we show how the Jensen- Shannon divergence can be interpreted in terms of 

161 how well an optimal decision-maker can perform in gaining knowledge about the system in question, 

162 and present the Bhattacharyya bound, which provides a strong bound on how well a rational observer 

163 can perform as more data comes in. 

164 2.1. Kullback-Leibler Divergence 

165 The first, Kullback-Leibler divergence, is an asymmetric measure. It can be interpreted as the answer 

166 to the following question: if the true underlying distribution is p, what is the asymptotic rate at which 

167 evidence accumulates against the alternative g? 

We can derive the Kullback-Leibler divergence explicitly. We first write the probability of seeing a 
particular empirical distribution n given p as 

k 

P{n\p) = nX[pI\ (6) 

i=l 

and similarly for the distribution g; is a combinatoric constant common to both P. We can then write 
the asymptotic, geometric average of the ratio as 

A=lim -AJ^ = exp lim - log . (7) 

n-s>oo \F[n\q) J \n-^co n r[n\q) J 

168 Put picturesquely for the case of the Old Bailey, imagine a courtroom observer who has already heard an 

169 enormous amount of a trial, and that this trial is drawing from the distribution p. Each new observation 

170 will tend to confirm her belief that the trial is, indeed, drawing from p (and not q), and will change her 

171 estimate of the relative probability of outcome, P{p) / P{q), by a factor A. 

The Kullback-Leibler (KL) divergence, D(p,q) is then defined as the average value (in p) of the 
logarithm of A: (log2 A)p. It can be written succinctly as 

k 

P>{pa) = ^Pdog-. (8) 

172 If there is a non-zero probability in p of hearing a term that is impossible (zero probability) in q — a 

173 "magic word" that can only be produced by p — then there is a non-zero chance that the next word 

174 the observer will hear will make P{n\q) zero and so the Kullback-Leibler divergence will be infinite. 
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175 Furthermore, since there is a non-zero probability for a resampling from p and q to create such a magic 

176 word, it is impossible to use bootstrap methods to correct for potential bias or to estimate error bars.^ 

177 These infinities are particularly troublesome in empirical work where sparse sampling can generate 

178 magic words that do not reflect truly deterministic signals in the underlying process. In the next section, 

179 we investigate an alternative method for estimating the differences between distributions, the Jensen- 

180 Shannon divergence, that is better behaved. 

181 2.2. Jensen-Shannon Divergence 

182 In contrast to the KL divergence, the Jensen-Shannon (JS) divergence, is symmetric. It can be 

183 interpreted as the answer to the following question. Assume that a sample is drawn from either p or 

184 g; p is chosen with probability a. How much is my uncertainty about which of the two distributions was 

185 used reduced by this single draw? 

186 Again, put picturesquely, imagine our observer walks into a trial at the Old Bailey at random. Given 

187 the disposition of the judico-social institution as a whole, she has some (let us say correct) belief about 

188 the probabilities of outcomes. These leave her more or less uncertain about the fate of this particular 

189 defendant. Now she hears a single word spoken in the room. How much is her uncertainty {i.e., the 

190 entropy of the two-category choice guilty vs. not-guilty) reduced? The answer is the Jensen-Shannon 

191 divergence, Jq,(p, g). 

192 More formally, the JS is the mutual information between a draw from one of the two distributions, 

193 and Z, a binary variable indicating which of the two distributions the process actually chose to draw 

194 from. When logarithms are base-two, this means that the Jensen- Shannon divergence is always between 

195 zero and unity (one bit). 

The JS divergence has the functional form [4] 

Ja{p, <?) = aD{p, rh) + l3D{q, m), (9) 

where m is defined as 

m = ap + (3q, (10) 

with a between zero and unity and (3 equal to 1 — a. An equivalent definition, from which some identities 
become easier to derive, is 

JMq) = H{ap + (3q) - aH{p) - l3H{q). (11) 

196 While less well-known than the KuUback-Leibler divergence, the Jensen-Shannon divergence is 

197 always finite, and so it is possible to attempt bootstrap bias correction and error estimation. When 

198 taking the square root, it also satisfies the triangle inequality (Ref. [5] and references therein) and so can 

199 even function as a metric. 

Just as for entropy, we have a coarse-graining consistency relationship for the JS divergence. In 
particular, for two distributions p, {pi,P2, Ps} and q, {gi, g2, Q's}^ we can define the probability of landing 
in the {2, 3} subspace, ps, as 

ps = oi{p2 + pz) + P{q2 + gs)- (12) 

''it is possible to correct the magic word problem post hoc, merging categories until no magic words remain. If this rule is 
specified with sufficient generality, it allows for bootstrap estimation. We do not consider this approach here. 
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200 We then have 



J'a({pi,P2,P3},{gi,g2,g3}) 



^a({pi,P2 +P3}, {gi,g2 + gs}) 
+Ps (P2 +P3 ) /Ps ( {P2 , P3 } , { g2 , gs } ) 



(13) 



201 where we silently renormalize probabilities in the subspace. Again, the consistency relationship shows 

202 the nested structure of information-theoretic reasoning: the additional information provided by more 

203 fine-grained distinctions appears, in weighted form, in the second term. 

For two empirical distributions, n and m, we can define the naive estimator using Eq. 11, and from 
there define the bootstrap correction. 



204 We characterize the bias correction, and error estimation, properties of Eq. 14 in Sec. 4; as for the 

205 bootstrap-corrected entropy, the bootstrap-corrected leads to violations of the consistency relation, 

206 Eq. 13. Violations are slight, and of the same order as found for the entropy itself; Table 1 provides 

207 details. 

208 2.3. The Bhattacharyya Bound 

209 We now consider how information theory can place rigorous bounds on the abilities of observers both 

210 inside and external to the system. We will consider, in particular, how to measure a maximally-rational 

211 observer's rate of error — the fraction of the time we can expect her to be wrong — when inferring facts 

212 about the system. Rather than measure her error rate precisely, we will show how to bound it from above. 

213 Bounding the rate, as opposed to knowing it directly, will not be so great a loss. In making our 

214 subject maximally-rational, we have at the same time obscured some features of the system available 

215 to real-world observers. A real participant would have access to more information and thus be able to 

216 outperform the one we can describe. Thus, regardless of how well we estimate the error rate for our 

217 fiducial subject, we will only ever have (more or less strict) upper bounds. 

218 Consider, once more, our observer at the Old Bailey, who watches a randomly-chosen trial and is 

219 interested in determining its outcome. Let us take the probability of a guilty outcome to be a; for 

220 simplicity, assume that a is greater than 0.5. If our observer knows this, and nothing else, her best 

221 ("Bayes optimal" or, informally, rational [6,7]) guess, 'guilty', will be wrong with probability [1 — a). 

When the observer acquires new information, her probability of error will decrease. By a theorem 
of Lin's [4], we can bound this updated probability of error. For a single observation, the probability of 
error is bounded from above by 



222 In many cases, JaiPi 1) rn^Y be very small, and the bound on may not differ significantly from the 

223 bound given by prior knowledge. Indeed, in many real- world situations, this is to be expected — it would 

224 be remarkable, for example, if an observer were able to glean significant information about the progress 

225 of a trial from hearing only a single word! Unfortunately, the generalization of Pg to the case of multiple 

226 observations does not have a simple formulation in terms of the Jensen-Shannon divergence. 



Jcorr{n,m) = 2J(n,m) - (J(n*, m*))p(^*,^*|^,^) 



(14) 



(15) 
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A number of different ways exist to approximate bounds on Peiji), the probability of error after n 
observations. A commonly used one is the Bhattacharyya bound, 



227 The Bhattacharyya bound is an approximation to the stricter Chemoff bound [8]; in practice, it is very 

228 close [9,10], and far less computationally intensive to measure. Other approximations exist [11], but 

229 the functional form of the Bhattacharyya bound makes it possible to extend the bound to multiple 

230 observations. 

231 To give an example of the use of the Bhattacharyya bound, we can consider the statistics of criminal 

232 trials in the years prior to 1820. In the twenty year period between 1800 and 1820, the probability of a 

233 guilty verdict was 76% — and so the associated error rate, before any observations are made, is 24%. The 

234 associated Bhattacharyya bound, Pe(0), is 43% — not particularly tight {i.e., not close to to the true value 

235 of 24%). Indeed, given the ease with which the exact rate can be computed, it is not of great interest. 

236 The power of the bound quickly becomes apparent, however. A standard coarse-graining we use in 

237 our investigations of this system sorts words into 116 possible categories. For the twenty year period 

238 between 1800 and 1820, the p term for this particular coarse-graining, splitting guilty vs. not-guilty 

239 verdicts, is approximately 0.9980 ± 0.0002 — very close to, but not exactly, unity."* With knowledge of p, 

240 we can then compute the probability of error given an arbitrary number of words. The error rate drops 

241 to 5% after approximately eight hundred words, and to 1% at 1620. 

242 The Bhattacharyya bound has an important limitation: it refers to prediction not upon sampling 

243 repeatedly from a particular instance of a class (a particular trial), but on sampling from the overall 

244 distribution. These two cases may, or may not, be equivalent. An obvious way for the equivalence to fail 

245 is for the underlying bag-of-words model to fail: if a trial's semantic features are not independent draws, 

246 but that prior text within the trail alters the distribution from which the remainder is drawn. 

247 A less obvious way for the assumption to fail is if there are multiple sub-classes. It may, for example, 

248 be the case that a guilty-verdict trial can take two forms: independent draws from pi, or independent 

249 draws from p2. If the two sub-classes are equally likely, then we will measure p to be the average of pi 

250 and p2, but there is no such thing as a trial that draws from p.^ Note that when the sub-class membership 

251 is unknown, the second failure mode also leads to conditional dependence — each draw gives better 

252 information about which sub-class has been chosen, and thus alters one's beliefs about subsequent draws. 

For single draws, this does not matter; it only becomes apparent when making multiple draws; if the 
n draws are represented as {rii}, and the two sub-classes are equally likely, we have 



We can estimate p itself by means of the statistical bootstrap. Resampling suggests that p itself may suffer from bias 
which can be corrected for. 

^Consider, for example, the case where the supports of pi and p2 only partially overlap: pi{x) is zero for some x, but 
P2{x) is non-zero, and vice versa for some y. The draw {x^y} is not possible for either pi or p2, but is possible for the 
average. 




(16) 



where p is 




(17) 




(18) 
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Figure 1. Prediction error curves and the existence of multiple classes. Solid curve: the 
Bhattacharyya bound for prediction of trial outcome for the period 1800 to 1820. Triangle 
symbols and solid line: actual prediction error, when drawing samples (words) from all 
trials within a class (guilty or not-guilty). As expected, the curve lies strictly below the 
Bhattacharyya bound. Diamond symbols and dashed line: actual prediction error, when 
drawing samples from a single trial. The prediction error actually rises (more samples lead 
to a less accurate prediction), suggesting that the underlying model (trials sample from one 
of two distributions) is incorrect. We restrict the set of trials here to those with at least 
one hundred (semanticaUy-associated) words, so as to make the resamphng process more 
accurate. 



0.4 F 




Number of words 



which is distinct from the single-class case used to derive the Bhattacharyya bound, 

n 

P{p\n) (xa\\{pi{ni)+p2{ni)), (19) 

i=l 

253 which contains cross-terms. The functional form of Eq. 18 makes it impossible to derive an equally 

254 simple version of Eq. 16 for the multiple-class case. 

255 We can study the validity of the assumptions of the Bhattacharyya bound by comparing the predicted 

256 bounds on the error rate with the actual success we have on predicting the outcomes of trials in the 

257 dataset itself. Fig. 1 plots (1) the Bhattacharyya bound, (2) the error rate if each observation samples 

258 at random from the set of all trials with a particular outcome, and (3) the error rate if each observation 

259 samples only from a single, randomly chosen, trial (and predictions are made using Eq. 19). 

260 As expected, (2) is bounded by (1), but the error rate for (3) actually rises: we do worse at predicting 

261 the verdict for a particular trial when given the transcript. Such an outcome strongly suggests, of course, 

262 that assumption of a single class is wrong; conversely, that there are different ways to be found guilty, 

263 and that these differences leave signatures in the semantic features of the trials themselves. It is a form of 
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264 ergodicity breaking that one expects to be common in social systems: a single (guilty verdict) trial will 

265 not sample the full space of possible (guilty verdict) trial features. Correct estimation of the error rate 

266 for a real-world observer of a particular trial requires one to estimate the number of sub-classes within 

267 each verdict. 

268 2.4. Summary 

269 Measuring the extent to which two distributions differ is a common question in an information- 

270 theoretic context. Differences between distributions are naturally interpreted as decision problems: 

271 how well ideal observers can distinguish different outcomes. Different measures have different 

272 interpretations, and while these differences appear subtle, their properties may make them more or less 

273 useful for empirical study. 

274 The KL divergence can be interpreted in terms of an asymptotic rate at which information for a 

275 particular hypothesis is accumulated, given that the hypothesis is true. It has the unfortunate property of 

276 becoming infinite under conditions we would expect in the real world. 

277 Meanwhile, the JS divergence is well behaved. It can be interpreted as the amount of information (in 

278 bits) relevant to distinguishing two outcomes that is contained in a single observation of the system. 

279 Finally, the Bhattacharyya bound extends the Jensen-Shannon divergence to the case of multiple 

280 observations. Care must be taken in its use, since a binary decision task may involve multiple sub-classes; 

281 the bound is strictly true only when considering draws from the overall distribution. 

282 3. Correlation, Dependency and Mutual Information 

283 On April 3rd, 2005, in Spin Boldak, a town in Kandahar province, members of the Afghanistan 

284 insurgency remotely detonated a bomb concealed in a beverage vendor cart. The resulting explosion 

285 killed two people: a civilian and a police officer.^ It was one of eight events in the country recorded 

286 by members of the International Afghanistan Security Forces (ISAF) that day. To what extent was this 

287 event coordinated with others that day, week, or year? 

288 The event described above is drawn from the Afghan War Diary (AWD) database, a remarkably 

289 detailed account of the Afghanistan conflict and the actions by both the insurgency and ISAF. The open- 

290 source nature of the release has led to a number of efforts to characterize the data [12,13]. It is likely to 

291 become a standard set for both the analysis of human conflict and the study of empirical methods for the 

292 analysis of complex, multi-modal data. The release amounts to of order 70, 000 SIGACTs ("Significant 

293 Activity Reports"), which record detailed information about individual events. 

^The full report, as used in the analysis of this section, appears in Appendix Sec. 7. 
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294 After filtering the data, we distinguish between SIGACTs that record insurgent-initiated events vs. 

295 those that record ISAF-initiated events^ We then choose either of the two sets, group the events by the 

296 day on which they occurred, and generate symbolic time series for each province. 

297 We coarse-grain the complex information available in the SIGACT data by means of a four-state 

298 codebook, where codes are assigned based on the severity of violence. In order of increasing severity, 

299 Code is when no events are recorded in the province that day; Code 1 is when events are recorded, but 

300 no injuries or deaths are associated; Code 2 is when one or two injuries or deaths are recorded; Code 3 is 

301 when more than two injuries or deaths are recorded. Taking the example of Kandahar province on April 

302 3rd, there was, in addition to the Spin Boldak bombing, one other insurgent-initiated event recorded: a 

303 second lED (improvised explosive device) explosion with no reported injuries or deaths. Based on these 

304 two facts, we assign the insurgent time stream for Kandahar province Code '2' . 

305 Central to an understanding of modern insurgencies is measuring (1) the level of communication and 

306 coordination among insurgent groups [14]; and (2) signaling of intents and abilities, both among groups 

307 and between groups, government actors and the civilian population [15,16]. This leads directly to the 

308 information theoretic question of the extent to which this event is coordinated with other events in the 

309 system, and the extent to which the other events may or may not have played a signaling role. 

310 As part of an ongoing collaborative investigation [17] we would like to know — or at least bound — the 

311 minimal amount of information shared between systems for the purposes of synchronization or response, 

312 and how the temporal structure of this shared information changes in time. Examining the data we have 

313 to hand within the axiomatic framework of information theory is likely to provide a novel approach to 

314 longstanding conceptual and quantitative questions at the center of the study of human conflict. 

315 Because of its centrality to the study of decentralized insurgency, we focus in this section on the 

316 question of signaling and coordination. We focus in particular, on two neighboring provinces, Kandahar 

317 and Helmand, in the year 2005, to show the kind of questions information theory allows us to pose, and 

318 the provocative answers it provides. 

319 3 . 1 . Mutual Information 

Mutual information is a specialization of the divergence measures considered in the previous section. 
In particular, it measures the KL distance between two distributions: a joint distribution, pij, and one 
derived from pij but in which the processes are forced to be independent, 

I{pij) = D{p,j,pi.p.j), (20) 

where the marginals are 

kj 

Pi- = ^Pij, (21) 

i=i 

^There are more than 78,000 SIGACTs in the original data set. We removed approximately 20,000 because they 
expressed a level of ambiguity about what happened ("unknown-initiated action" or "suspicious incident"), because their 
GPS coordinates did not match the description, because they were not time-sensitive ("weapons cache found") or because 
they were irrelevant to our study ("lED hoax" or "show of force"). Information on initiative is explicitly included in the 
SIGACT data and is extracted by the methods of Ref. [12]. 
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320 and similarly for p.j. If the space of events labelled by i is A, and B for j, we often write I {A, B) for 

321 Eq. 20. 

322 A standard, and useful, interpretation of mutual information is the average reduction in uncertainty 

323 (entropy) of the value of a sample from A, given knowledge of the value of a sample from B. This 

324 means, among other things, that the entropy of A is an upper limit on the mutual information between A 

325 and any other variable. 

As with entropy and the Jensen-Shannon divergence, the naive estimator can be used to define a 
bootstrap-corrected version, 

hovr{n) = 21{n) - {I{n*))p^n*\n) (22) 

326 where coarse-graining consistency is now preserved only approximately. 

327 By contrast with ifcorr> it is possible for /corr to be less than zero: observations that happen to produce 

328 a precisely factorizable empirical distribution, for example, so that I{n) is zero, will have non-zero 

329 probability to resample to a distribution that does produce correlations. Thus, the information inequality, 

330 I{X; F) > with equality if and only if X and Y are independent, does not hold for /corr- This parallels 

331 the difficulty, in many empirical studies, of establishing complete conditional independence given finite 

332 data [18,19], and requires reliable estimation of error ranges in order to prevent reporting of nonexistent 

333 relationships. 

334 Much like the quantities of the previous section, mutual information does not directly measure 

335 causation. This can be seen explicitly in the symmetric structure of Eq. 20, where I{A, B) is equal to 

336 I{B, A). There are a number of methods for finding (approximate) answers to causal questions [20,21]; 

337 a common starting point is to examine time-lagged mutual information: the methods we describe here, 

338 and the means by which they are characterized, are equally amenable to the case where A is time-lagged 

339 relative to B. 

340 The mutual information for the two provinces, given only knowledge of four-state codes, is 0.04 ± 

341 0.02 bits (bootstrap corrected); same-day knowledge of the events gives a small, but detectable, boost 

342 to predictive ability, indicating some pathway for the sharing of information between the two processes. 

343 We emphasize that such a pathway may not be direct, and may involve common cause nodes which act 

344 not as a conduit of information, but as a synchronizing signal. 

345 Obvious exogenous, common causes include external pohtical factors such as a national election, 

346 and seasonal weather patterns that make it hard for the insurgency to act during harsh winters. For 

347 example, if the insurgency usually conducts daily high-severity events (codes 2 and 3) but a harsh winter 

348 makes this impossible, knowing the severity of an event in Kandahar will tell you the season (winter/not- 

349 winter, one bit), and lead to (at least) one bit of mutual information between Kandahar and Helmand 

350 provinces — without any direct causal influence. 

351 Fig. 2 shows how mutual information can be used to determine both the timescale and directionality 

352 of information flow. We consider the mutual information between a single day in one province, and the 

353 modal day, on some date range, for the second province. Taking the modal day {i.e., the most common 

354 symbol found in that date range, with ties being broken by taking the median of the modes) is essentially 

355 a coarse-graining of the exponentially large multi-day state space; it scrambles time information within 

356 the range, and amounts (in the language of renormalization theory) to a particular decimation choice. 
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Figure 2. Predictability of the Kandahar and Helmand time streams. Top: a dramatic 
asymmetry on short timescales provides strong suggestion of anticipatory, and potentially 
causal, effects transmitted/rom Kandahar to Helmand province on rapid (less than two-week) 
timescales. Bottom: the consistent, opposite asymmetry is seen in the reverse process. A rise 
in the predictability of Kandahar by Helmand on longer (one-month) timescales, mirrored 
in the top panel, suggests potentially longer-term seasonal or constraint-based information 
common to both systems. 




-40 -20 20 40 
Timescale (Days^ 



_^ 0.25 

^ 0.20 
c 
o 

0.15 P 



o 0.10 
c 

^ 0.05 



Helmond post predicts 
Kondohor present 



0.00 



Kondohor present predicts 
Helmond future 




■40 -20 20 40 
Timescale (Days' 



Version February 6, 2013 submitted to Entropy 



15 of 29 



357 The top panel of Fig. 2 considers the mutual information between a day in Helmand province, and 

358 the modal day of either a range of dates in the past (negative x-axis) or future. For example, knowledge 

359 of the modal day for the prior twenty days in Kandahar leads to approximately 0. 1 bits of information 

360 about the current day in Helmand. We can turn that phrasing around for the positive x-axis, and note 

361 that knowledge of current day's events in Helmand provide much less information about the modal day 

362 in Kandahar's near future. 

363 This effect is reversed (lower panel) where we find that knowledge of Kandahar's present gives more 

364 information about Helmand's future modal days than the reverse. On longer timescales, there is some 

365 increase in the predictive power of Helmand's past for Kandahar's future (—35 days on the a;-axis), which 

366 could potentially be attributed to knowledge of seasonal properties that affect both provinces — a similar 

367 predictive power is seen on the same timescales, in the same position, for the Top panels, suggesting a 

368 common cause with similar effects. 

369 3.2. The Data Processing Inequality 

A novel relationship that arises for the case of mutual information is the data processing inequality. 
In its simplest form, as found in Ref. [8], it states that if three random variables X, Y and Z form a 
Markov Chain in that order, i.e., 

P{x, y, z) = p{z\y)p{y\x)p{x), (23) 

then 

I{X,Y)>I{X,Z). (24) 

This is called the data processing inequality because the transformation from Y to Z can be seen as an 
(independent) "processing" of the output of a measurement of Y, which can not (under the assumption 
of a perfectly rational observer) add any new information about X. Directly relevant to our work here 
is a four-term version of the inequality, where two underlying random variables, X and Y, are known 
by two independent post-processings, A and B. For us, the two mappings, X ^ A and Y ^ B, are 
a good description of how a massively multi-dimensional random variable, describing the full state of 
a province, is reduced to the sum of the observed SIGACTs and, further, to the four-state codebook 
considered here. If the structure of the required conditional independencies is 

p{a, X, 6, y) = p{x, y)pia\x)p{b\y), (25) 

then the mutual information between A and B is bounded by that between the underlying system, 

I{XX)>HA,B). (26) 

370 Eqs. 24 and 26 play a role similar to the Lin and Bhattacharyya bounds for decision-making in Sec. 2.3. 

371 If we are interested in the coordination of violence between provinces, then measurement of the mutual 

372 information between the dimensionality-reduced data provides a strict bound to the full system; phrased 

373 in the language of Sec. 2.3, it bounds the predictability, by a rational agent with at least as much 

374 information, from below. 
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375 A simpler form of the data processing inequality can be found for the case where the post-processing 

376 is deterministic, i.e., the entries of p{a\x) and p{h\y) are only either zero or one. This version of 

377 the data processing inequality amounts to the statement that coarse-graining a process will not, on 

378 average, increase the mutual information with a second data stream. This extends to any combination of 

379 deterministic coarse-graining and remapping {i.e., uniform reclassification of one of the event spaces). 




Pll Pl2 Pl3 
P21 P22 P23 

Pu {Pl2 + P13) 
P2I {P22 + P23) 

+ (l-(l>n+^.i))/fl (27) 



P22 P23 

380 We study the preservation of this axiomatic relationship in detail in Sec. 4.2. 



381 4. The Bootstrap Estimators In Practice 

382 The previous two sections have introduced two applications of information theory to the study of 

383 large-scale collective behavior. The framework allows us to quantify conceptually important aspects 

384 of both systems. Since estimation of these quantities is often from noisy, sample-limited data, we 

385 would like to know how our estimators perform in practice. We consider here the performance of 

386 the bootstrap estimators in the preservation of coarse-graining consistency, and in the error estimates 

387 and bias correction they provide. This section provides the main technical, as opposed to empirical or 

388 conceptual, results of our paper. 



389 4.1. The Bayesian Prior Hierarchy 

In order to characterize the bootstrap, we must first consider the range of problems we hope to 
apply it to. Doing this amounts to defining an "underlying" distribution over distributions. For 
the discrete-symbol processes we have considered above, a mathematically elegant choice for the 
underlying distribution involves the use of the Dirichlet distribution. For the homogenous case, the 
Dirichlet distribution is parameterized by a single parameter, (3, where the probability of any particular 
distribution, p, arising is 

k 

Pip\D,) = ^llpt\ (28) 

i=l 

390 where Z((3) is a normalization constant. While Di, sometimes called the "Laplace prior", is a common 

391 choice, it was noted by Ref. [22] that it has unusual information-theoretic properties; in particular, the 

392 average value of the entropy of a distribution drawn from Di is quite high. Ref. [22] suggested an 

393 interesting alternative: to construct a mixture of Dirichlet distributions, -Dnsb such that the entropy of a 

394 distribution drawn from -Dnsb is approximately uniform. 

395 The question of Bayesian estimators for information-theoretic quantities under a Dirichlet prior 

396 was addressed by Ref. [23] (WW), where tools were provided for estimation of entropy and mutual 
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397 information for arbitrary (3, and explicit formulas for the Di case were given; these are analytic in terms 

398 of polygamma functions. Ref. [22] provided a numerical method for the estimation of entropy under the 

399 -Dnsb priors. In order to produce an NSB estimator for the mutual information, we can extend Theorem 

400 10 of Ref. [23] to the (3^1 case, and then integrate this over the NSB prior (Eq. 9 of Ref. [22]); this 

401 estimates mutual information under the assumption that draws of the pij are (approximately) uniform in 

402 entropy. 

403 In the Bayesian framework, more general choices of prior allow for the proper evaluation of a wider 

404 range of models; conversely, Bayesian estimators will often fail when given a sample whose underlying 

405 model lies outside the prior support. The -Dnsb prior is more general than the Di prior, and as such has 

406 a wider range of applicability; we expect the NSB estimators to strongly outperform the WW estimators 

407 when evaluated on distributions drawn from Dirichlet distributions with (3 much less than unity, for 

408 example. 

409 Both priors, however, assume that bins are drawn from distributions with homogenous weights {i.e., 

410 (3 a constant independent of i in Eq. 28). This assumption is likely to fail in real world systems, and 

411 its failure may lead to inaccurate inferences. This is particularly problematic in biological and social 

412 systems where there are few clues to the correct choice of binning: if a three state system is best modeled 

413 as a draw from Dnsb^ then this assumption will fail for a different observer, who, under the influence of 

414 a rival theory, gathers data in such a way as to group two of the bins together to get a two-state system 

415 that now draws from an inhomogeneous distribution. 

416 While a single Dirichlet distribution with inhomogeneous weights leads to asymmetries that may 

417 be hard to justify a priori, this symmetry may be restored in Dirichlet mixtures. We thus consider in 

418 this paper a novel mixture, D' , that allows a distribution to be drawn from a range of inhomogeneous 

419 Dirichlet distributions. A draw of a distribution p with k bins from D' is made as follows: 

420 1. Draw a random integer, k' , between k and fc^ inclusive. 

421 2. Draw a distribution p' with k' bins from -Dnsb- Then p' is approximately uniform in entropy over k' 



423 3. Randomly partition the k' bins into k bins. 

424 4. Coarse grain the distribution p' , given the partition of Step 3, to get p. 

425 This construction always amounts to a draw from some Dirichlet distribution (since a coarse-graining 

426 of a Dirichlet distribution is itself a Dirichlet distribution with different parameters). Our use of random 

427 partitions restores bin symmetries and ensures that we are not placing unwarranted a priori structures 

428 on the average properties of draws. Random partitioning may be done rapidly by modification of the 

429 ranksb and rancom algorithms from Ref. [24]. 

Note that just as Di is strictly contained within Z^nsbj -Dnsb is strictly contained within D' (in 
particular, -Dnsb is selected when k' is drawn equal to k). This gives us a hierarchy of Bayesian priors. 



422 



bins. 



C Dnsb C D' 



(29) 



430 



where the set containment here is interpreted in terms of the support of the distributions. 
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Table 1. RMS Violations of coarse-graining consistency for entropy (Condition 2) for 
the Wolpert & Wolf (WW), Nemenman, Shafee & Bialek (NSB), and the bootstrap. The 
bootstrap estimator leads to a factor of ten improvement in coarse-graining consistency; as 
the amount of data increases, the bootstrap approaches full consistency faster. The average 
entropy of the three-state distributions is approximately 1.2 bits. These results are for the D' 
prior of Sec. 4.1. 



Samphng 


WW 


NSB 


Bootstrap 




(RMS bits) 


(RMS bits) 


(RMS bits) 


Ix 


0.1402 


0.0878 


0.0069 


2x 


0.1112 


0.0649 


0.0039 


4x 


0.0769 


0.0389 


0.0017 


8x 


0.0485 


0.0207 


0.0009 


16x 


0.0282 


0.0106 


0.0006 



431 Inhomogeneous Dirichlet mixtures of the D' form have not been a focus of study in the context 

432 of Bayesian inference, and much work remains to be done. Step 1 clearly admits different 

433 parameterizations, for example, as do modifications of the partitioning algorithm for Step 3. We consider 

434 the question of extending Bayesian NSB methods to these inhomogeneous mixtures an interesting 

435 problem for future work. In this paper, our goal is to compare how well the Wolpert and Wolf, and 

436 NSB estimators compare against non-Bayesian bootstrap methods. In the following section, we answer 

437 this question. 

438 4.2. Coarse-Graining Consistency 

439 The coarse-graining consistency relationship for entropy. Condition 2, and the associated coarse- 

440 graining relationship for mutual information, Eq. 27, are central to basic information-theoretic results. 

441 These include the chain rules for entropy and mutual information (Theorems 2.5.1 and 2.5.2 of Ref. [8]), 

442 and that means that the standard Venn diagrams that dictate the relationships between entropies (joint 

443 and marginal) and mutual information hold. 

444 The approximate satisfaction of Condition 2' allows us to recover, approximately, nearly all of the 

445 structure of Information Theory in the finite-data limit. For example, since a conditional entropy, 

446 H{X\Y), can be turned into a difference of entropies, H{X,Y) — H(Y), and each term can be 

447 consistently estimated by //con-, our method allows us to estimate any information-theoretic formula 

448 that analytically decomposes into the sum of entropies and conditional entropies, and bias-correct if so 

449 desired. 

450 For these reasons, we study how well the bootstrap preserves Condition 2. Because the satisfaction of 

451 Eq. 27 is directly related to the deterministic version of the data-processing inequality, we also test Eq. 27 

452 directly. Our results follow directly from sampling ps from D', and then characterizing the performance 

453 of the different methods in estimating (known) properties of p from finite samples drawn from p. We 

454 present these results in terms of the sampling factor, defined as the number of observations divided by 

455 the number of system states. 
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Table 2. RMS Violations of coarse-graining consistency for mutual information (Eq. 27) 
for the Wolpert & Wolf (WW) estimator, Nemenman, Shafee & Bialek (NSB) for Mutual 
Information, and the bootstrap. The bootstrap estimator again leads to a factor of ten 
improvement in coarse-graining consistency; as the amount of data increases, the bootstrap 
approaches full consistency faster. The average mutual information of the 2x3 distribution 
is approximately 0.25 bits. 



Samphng 


WW 


NSB-MI 


Bootstrap 




(RMS bits) 


(RMS bits) 


(RMS bits) 


Ix 


0.0277 


0.0473 


0.0036 


2x 


0.0163 


0.0379 


0.0015 


4x 


0.0094 


0.0241 


0.0007 


8x 


0.0052 


0.0139 


0.0004 


16x 


0.0029 


0.0075 


0.0003 



456 Table 1 shows that bootstrap estimator provides a large gain RMS consistency compared to two 

457 commonly-used estimation methods. Table 2 considers the mutual information consistency relation, 

458 Eq. 27. In a similar fashion to the case of Table 1, we find that the bootstrap has much improved 

459 performance compared to both WW and NSB. 

460 The relatively poor performance of these two estimators is due in part to the homogenous nature of 

461 the Di and -Dnsb priors, which place equal weight on all known bins. The true system (equal weight in 

462 three bins, lopsided weights in the coarse-grained version) is not contained within either space of priors. 

463 In both cases, we consider coarse-grainings of distributions drawn from D'\ however, our results are 

464 largely insensitive to whether we use Di or -Dnsb instead. In particular, the relative performance of the 

465 bootstrap and the Bayesian estimators is unchanged. The results of Tables 1 and 2 are a main technical 

466 result of this paper. 

467 The entropy consistency relationship of Condition 2 and Table 1 is directly relevant to the analyses of 

468 Sec. 2 of the predictability of trial outcomes at the Old Bailey. There we are concerned with a range of 

469 different possible ways to coarse-grain the underlying trial transcripts, all of which represent strong and 

470 contrasting theories of linguistic semantics, and none of which are given to us a priori. Preservation of 

471 coarse-graining consistency means that we will not find anomalous gains or losses in predictive power 

472 by changing our underlying theories of social cognition and predictive abilities. 

473 Meanwhile, the mutual information consistency relationship, Eq. 27 and Table 2, is particularly useful 

474 in analysis of information flows between highly complex underlying state spaces, as was considered in 

475 the case of Afghanistan in Sec. 3.^ Simplifying the codebook, or enlarging it to account for additional 

476 features of the SIGACTs or other parallel data streams, will, again, lead to consistent shifts in the 

477 estimated levels of coordination, signaling and predictability that reflect the structure of the underlying 

478 system, and not features of the prior space. 

^Care needs to be taken, however, since a general stochastic function of the underlying process will not, in general, 
preserve independence in the empirical distribution. It is always possible, for example, to choose an instantiation of a 
stochastic re-mapping post hoc that magnifies accidental correlations found in the original empirical distribution. We consider 
this an open question worthy of further investigation. 
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479 4.3. Bias Correction and the Reliability of Error Estimates 

480 Having established the utility of the bootstrap in preserving information-theoretic axioms, we 

481 conclude this technical section by characterizing the reliability of its bias correction and error estimates. 

482 We ask, in other words, how well we estimate the quantities in question, and how well we estimate our 

483 uncertainty about them. 

Our use of the bootstrap involves a bias correction, and so we first want to know how well the 
correction works and how that correction compares to other methods in the literature. For a particular p, 
bias is defined as 



484 for the case of entropy, and similarly for mutual information and Jensen-Shannon divergence. Informally, 

485 Bh {n, p) asks what the average difference is between an estimate of H, from a sample of n observations 

486 drawn from p, and the true value, H{p). In many cases, unbiased estimators are possible; as noted in 

487 Sec. 1 , however, estimates of information theoretic quantities are necessarily biased and the real question 

488 is the extent to which this bias is reduced by appropriate choice of estimator. 

The bias will vary depending on the particular p chosen; for simplicity, we consider the average bias 
for probabilities that are themselves drawn from a distribution; explicitly, we consider 



489 The use of a Bayesian estimator with prior D guarantees that Bnin, D) will be zero (note that this does 

490 not violate the results of Ref. [2] — ^bias for any particular p may be non-zero); as discussed above, since 

491 we draw from an inhomogeneous prior, D' , that is strictly larger than either of the priors used in our 

492 Bayesian estimators, this study also allows us to characterize the performance of NSB and WW when 

493 predicting out of class. 

494 We also want to know how trustworthy our error estimates are. One useful way to quantify this is to 

495 ask how often the true value of the quantity in question lies within the la (one standard deviation, or 

496 68.2%) and 2a (two standard deviations, or 95.4%) ranges. 

497 Our results on bias and error reliability are shown for two test cases — the estimation of the entropy of 

498 a 16-state system (Fig. 3), and the estimation of mutual information for a 4 x 4 joint probability (Fig. 4). 

499 Bias correction often comes at a cost, since the relationship between the true bias and its estimated 

500 value is itself noisy. We find, however, that the RMS error of the bootstrap estimator is comparable to our 

501 best Bayesian estimator, NSB; this is shown in Tables 3 and 4. 

502 4.4. Summary 

503 Our results in this section provide strong support for the use of the bootstrap in the parameter ranges of 

504 relevance to many estimation problems. While the NSB estimator is unbiased over the D' space (within 

505 error) for the entropy estimation case, it does violate the coarse-graining consistency relationship much 

506 more strongly. Meanwhile, the bootstrap is comparable to the NSB in bias and RMS error when used for 

507 estimation of more sophisticated quantities such as mutual information. 

508 As an example of the use of the characterizations of this section. Fig. 4 allows us to read off, directly, 

509 useful information necessary for the evaluation of the claims of Sec. 3 for asymmetries of information 



BH{n,p) = {Hcst{n)) p{H\p) - H{p), 



(30) 



Bnin^D) = {BH{n,p)) p^j^n)- 



(31) 
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Figure 3. Left panel: bias for the 16-state entropy estimation case, under prior D' . Dotted 
line: naive estimator; Dashed line: Wolpert and Wolf estimator; Solid line: Bootstrap 
estimator. We find the NSB estimator has the nice property of being (within error) unbiased 
on average over D' as well as -Dnsb- Right panel: one-sigma (solid line) and two-sigma 
(dashed line) error bar reliability; as the sampling factor increases, both rapidly approach 
their asymptotic values (thin horizontal lines). 
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Figure 4. Left panel: bias for the estimation of mutual information on a 4 x 4 joint probability 
distribution, under prior D' . Dotted line: naive estimator; Dashed line, 7»r-symbol: Wolpert 
and Wolf estimator; Dashed line, D-symbol: NSB estimator. Solid line: Bootstrap estimator. 
Right panel: one-sigma (solid line) and two-sigma (dashed line) error bar reliability; as the 
sampling factor increases, both rapidly approach their asymptotic values (thin horizontal 
lines). 
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510 flow in Afghanistan. For the 4x4 mutual information estimation problem, we have at least 305 days 

511 worth of observations (for the edges of the ±60 day range). This amounts to a 19 x oversampling, putting 

512 us on the right-hand edge of each panel in Fig. 4. Our bias is well below the overall signal, while our 

513 estimation of the la error band is seen to be reliable. 

514 5. Conclusions 

515 We have presented two case studies of the use of information theory for the scientific study of 

516 collective phenomena associated with cognitively complex social systems. Scientifically compelling 

517 accounts of the role of information — such as those that involve reference to optimal prediction or 

518 signaling — often rely on some of the central axioms of information theory, including coarse-graining 

519 consistency, error-rate bounds and the data processing inequality. In studying the nature of our (current) 

520 tool of choice, then, we have characterized the (approximate) preservation of these axioms in the use of 

521 these tools in ranges likely to be relevant to current data. 

522 Information theory originated in the need to describe, and place limits, on the ability of engineered 

523 systems to communicate and process signals, to infer properties of the outside world, and to tolerate risk 

524 and uncertain environments. The extension of information-theoretic concepts from engineered systems 

525 and inferential tasks to the biological and social sciences expands the domain of the theory and places 

526 the measurement of its quantities at center stage. 

527 In many cases, the role of human, animal, or otherwise evolved reason in a natural system means that 

528 information theory is just as relevant there as it is for the study of engineered systems. Improvements 

529 in our understanding of both biological and social systems in large part depends upon increasing our 

530 understanding of how they encode and process information. Many systems devote significant amounts 

531 of constrained resources to precisely these tasks: the representation of aspects of the environment, and 

532 transformations on those representations — as opposed to direct intervention in the environment itself. 

533 The design principles for evolved systems may be very different, but the underlying laws are the same. 

534 Just as the study of an organism's structural morphology will make reference to engineering concepts 

535 such as efficiency, stability and dissipation, so will accounts of how individuals behave in ambiguously 

536 cooperative environments make reference to information-theoretic concepts such as bounds on optimal 

537 predictions. 

538 Less obviously, but no less central, is the role of coarse-graining in the construction of scientific 

539 accounts of collective phenomena. 

540 In the physical sciences, coarse-graining is a fundamental part of the construction of theories. In 

541 condensed matter and quantum field theory, the notion of a renormalization group is based on spatial 

542 proximity: things that are physically near each other can be grouped together, and the theory relating the 

543 properties of these coarse-grained groups can be related to the theory corresponding to the finer-grained 

544 description in a systematic fashion. 

545 In the case of biological and social systems, functional and computational principles dominate over 

546 physical proximity, and the parallel construction is in its infancy [25]. In the interim, we often find 

547 it necessary to conduct analyses of such systems using informally-derived coarse-grainings dictated by 

548 a combination of domain- specific intuition (which properties should be grouped together), numerical 
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549 or analytic tractability (often, how much grouping is needed for reliable statistics, rejection of null 

550 hypotheses, and so forth), and the richness and accuracy of the underlying data itself. Indeed, in cognitive 

551 systems ranging from the neurobiological to the social, the emergence of this coarse-graining is itself a 

552 pressing scientific question [26-28]. 

553 For both these reasons — the desire to study the role of reasoning in nature, and our lack of knowledge 

554 about the right way to carve nature at its joints'^ — it is useful not only (on the one hand) to estimate 

555 information- theoretic quantities, but also (on the other) to derive functions of the data that obey its 

556 underlying axioms. The technical characterization of the bootstrap in Sec. 4 provides strong support for 

557 its use in place of other estimators when these axioms become important to the reasoning one wants to 

558 do. Future work in this field will almost certainly provide better tools — including, one hopes, a fully 

559 Bayesian method for preserving the axioms exactly — and new insights into the role that reason and 

560 inference play in the natural world. 
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570 6. Appendix: The Trial of John Long, as reported 18th September, 1820 

571 John Long was indicted for that he, on the 28th of August, upon George North, feloniously, wilfully, 

572 and maliciously did make an assault, and with a sharp instrument did strike and cut him, in and upon his 

573 face, with intent to kill and murder him, and do him some grievous bodily harm. 

574 Ann Hickman. I live in Gardner's lane. King street, Westminster. The prisoner and his wife lived 

575 in the same house. On the 28th of August I was looking out of window, and saw the prisoner and his 

576 wife in the yard, with North, consulting about parting — his wife had cohabited with North, and wanted 

577 her clothes to go away with him; the prisoner said if she came up stairs she should have them — then all 

578 three came into the passage. I heard words at the foot of the stairs, went down, and saw North against 

579 the wall — he said, "I am done for." I saw his hand drop against the wall, it was all bloody. The prisoner 

580 was about half a yard from him. I did not see him do anything, and never heard him threaten North. I 

581 saw a knife in his hand, and said "Long take care what you are at, and give me the knife." He shut the 

582 knife up, and gave it to me — it was rather bloody. North had three cuts in his face; I saw nothing more. 

583 He was taken to the doctor's. A quantity of blood laid at the door. Whatever happened was done before 

584 I came down. 

^The "other principle" of Socrates: that of "dividing things again by classes, where the natural joints are, and not trying 
to break any part, after the manner of a bad carver." [29]. 
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585 Henry Betts. I am a constable. About eight o'clock on the night this happened, I was sent for, and 

586 knocked at the prisoner's room door, and told him to open it. I found it open, he was there in bed with his 

587 wife, she was in liquor. I took him to the watch-house. Four or five days after I saw North, his wounds 

588 were dressed. He had one cut from his ear down towards his mouth, his lip was cut, and he had a stab in 

589 his cheek. 

590 Prisoner's Defence. My wife had deserted me, and gone with North; I met them together, and North 

591 said if I touched her he would break every bone in my body. She followed me home, I was going to take 

592 her up stairs, and he seized me by the throat. 

593 George North, being called, did not appear. 

594 [Total length: 398 words. Verdict: not guilty.] 

595 7. Appendix: SIGACT for lED Explosion, April 3rd, 2005, Kandahar Province 

596 Title 

597 lED ANP CIV Other 1 CIV KIA 1 AN? KIA 

598 Text 

599 CJSOTF REPORTS lED STIKE IVO SPIN BULDAK. THE FOLLOWING SALT REPORT WAS 

600 SENT: S: IX lED, A: lED STRKE, L: 42RTV 51 120 32190, T: 0315Z. REMARKS: FR SOF SENDING 

601 OUT RECON TO INVESTIGATE. AT 0722Z TG ARES CURRENTLY ON SITE CONDUCTING 

602 INVESTIGATION. THERE WAS ONE EXPLOSION FROM A MINE HIDDEN IN A BEVERAGE 

603 VENDORS CART. APPEARS TO HAVE BEEN COMMAND DETONATED. IX DISTRICT POLICE 

604 OFFICER IS KIA. NO WIA REPORTED. 

605 Additional Data 

606 { : reportkey=>"DCEAC77F-A84D-45F3-88B3-33B9B5A95B20", 

607 : type=> : explosivehazard, : category=> : iedexplosion, : track;ingnumber=> : 2007-033 

608 -005423-0737, : region=> : rcsouth, : attackon=> : enemy , : complexattack=>f alse, 

609 : reportingunit=> : other , : unitname=> : other , : typeof unit=> : noneselected, 

610 : f riendlywia=>0, : f riendlykia=>0 , : hostnationwia=>0 , : hostnationkia=>l , 

611 : civilianwia=>0, : civiliankia=>l , : enemywia=>0 , : enemykia=>0 , 

612 : enemydetained=>0, : mgrs=>" 42RTV51 10332 1 80 " , : latitude=>30 . 99594061, 

613 : longitude=>66 . 39333344, : originatorgroup=> : unknown, : updatedbygroup=> : unknown, 

614 :ccir=>:"", : sigact=> : " " , : affiliation=> : enemy , : dcolor=> : red, 

615 : classif ication=> : secret, : start=>2005-04-03 03:15:00 UTC, 

616 : province=> : kandahar , : district=> : spinboldak, : nearestgeocode=>" AF24 1 131834 " , 

617 : nearestname=> " Spin Boldak"} 

618 Source and Post-processing 

619 Original source http://wikileaks.org/afg/event/2005/04/AFG20050403n68.html; post-processing for 

620 initiative (via methods of Ref. [12]), geocode (GPS cross-reference to data provided by Afghanistan 

621 Information Management Service), and additional filtering by collaboration [17]. 
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Table 3. RMS errors for estimation of the entropy information of a 16- state system, for 
the Wolpert & Wolf (WW) estimator, Nemenman, Shafee & Bialek (NSB) for Mutual 
Information, and the bootstrap. The bootstrap estimator has RMS errors comparable to the 
NSB. 



Sampling 


WW 


NSB 


Bootstrap 




(RMS bits) 


(RMS bits) 


(RMS bits) 


Ix 


0.7945 


0.3352 


0.3825 


2x 


0.6256 


0.2242 


0.2368 


4x 


0.4515 


0.1497 


0.1540 


8x 


0.2930 


0.1035 


0.1050 


16x 


0.1801 


0.0728 


0.0733 



Table 4. RMS errors for estimation of the mutual information of a 4 x 4 joint probability, 
for the Wolpert & Wolf (WW) estimator, Nemenman, Shafee & Bialek (NSB) for Mutual 
Information, and the bootstrap. The bootstrap estimator has RMS errors comparable to the 
NSB. 



Sampling 


WW 


NSB 


Bootstrap 




(RMS bits) 


(RMS bits) 


(RMS bits) 


Ix 


0.2798 


0.1902 


0.2967 


2x 


0.2260 


0.1446 


0.1780 


4x 


0.1688 


0.1036 


0.1154 


8x 


0.1200 


0.0731 


0.0772 


16x 


0.0795 


0.0519 


0.0535 
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Figure 5. A pedagogical example of the bootstrap in action. 



0.1 5 



0.10 



0.05 



0.00 



0.1 5 



0.10 



0.05 



0.00 




.5 2.0 2.5 3.0 3.5 4.0 

Entropy (bits) 




.5 2.0 2.5 3.0 3.5 4.0 

Entropy (bits) 



true entropy 
of distribution 

naive entropy of sample 
from distribution 

distribution of (naVve) 
entropies for samples 
of the sample 

how much the sample 
underestimates the truth 



how much (on average) 
the resamplings 
underestimate the sample 

bootstrap-estimated 
distribution for the true entropy 



References 



623 
624 
625 
626 
627 
628 
629 
630 
631 
632 
633 
634 
635 
636 
637 
638 
639 
640 
641 
642 



8. 
9. 

10. 
11. 



Shannon, C. A mathematical theory of communication. Bell System Technical Journal 1948, 
27, 379^23, 623-656. 

Paninski, L. Estimation of Entropy and Mutual Information. Neural Computation 2003, 
15, 1191-1253. 

Klingenstein, S.; Hitchcock, T.; DeDeo, S. Institution Formation, Semantic Phenomena, and 
Three Hundred Years of the British Criminal Court System 2012-. Research Collaboration. 
Lin, J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information 
Theory 1991, i 7, 145-151. 

Nielsen, F. A family of statistical symmetric divergences based on Jensen's inequality. ArXiv 
e-prints 2010, cs.CV. 1009.4004v2. 

Hellman, M.; Raviv, J. Probability of error, equivocation, and the Chemoff bound. IEEE 
Transactions on Information Theory 1970, 16, 368-372. 

Jaynes, E.; Bretthorst, G. Probability Theory: The Logic of Science; Cambridge University Press, 
2003. 

Cover, T.; Thomas, J. Elements of Information Theory; Wiley, 2006. 

Chen, C.H. On information and distance measures, error bounds, and feature selection. 
Information Sciences 1976, 10, 159-173. 

Ito, T. Approximate error bounds in pattern recognition. Machine Intelligence 1972, 7, 369-376. 
Hashlamoun, W.A.; Varshney, P.K.; Samarasooriya, V.N.S. A tight upper bound on the Bayesian 
probability of error. IEEE Transactions on Information Theory 1994, 16, 220-224. 



Version February 6, 2013 submitted to Entropy 



28 of 29 



643 1 2. O'Loughlin, J.; Witmer, R; Linke, A.; Thorwardson, N. Peering into the Fog of War: The 

644 Geography of the WikiLeaks Afghanistan War Logs 2004-2009. Eurasian Geography and 

645 Economics 2010, 51, 472-95. 

646 1 3. Zammit-Mangion, A.; Dewar, M.; Kadirkamanathan, V.; Sanguinetti, G. Point process modelling 

647 of the Afghan War Diary. Proceedings of the National Academy of Sciences 2012, 109, 12414- 

648 12419. 

649 1 4. Sanm, F.G.; Giustozzi, A. Networks and Armies: Structuring Rebellion in Colombia and 

650 Afghanistan. Studies in Conflict & Terrorism 2010, 33, 836-853. 

651 15. Gutierrez Sanin, F. Telling the Difference: Guerrillas and Paramilitaries in the Colombian War. 

652 Politics & Society 2008, 36, 3-34. 

653 1 6. Green, A.H. Repertoires of Violence Against Non-combatants: The Role of Armed Group 

654 Institutions and Ideologies. PhD thesis, Yale University, 2011. 

655 1 7. Hawkins, R.; DeDeo, S. The Emergence of Insurgency in Afghanistan: an Information Theoretic 

656 Analysis 2012-. Research Collaboration. 

657 1 8. Hutter, M. Distribution of Mutual Information, eprint arXiv:cs/01 12019 2001. http://arxiv.org/ 

658 abs/cs.AI/01 12019; Advances in Neural Information Processing Systems 14 (NIPS-2001) pg. 

659 399-406. 

660 1 9. Zaffalon, M.; Hutter, M. Robust feature selection by mutual information distributions. 

661 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence 2002, pp. 

662 577-584. 

663 20. Williams, PL.; Beer, R.D. Generalized Measures of Information Transfer. eprint 

664 arXiv:l 102.1507 [physics. data-an] 2011. http://arxiv.org/abs/1102.1507. 

665 21. Schreiber, T. Measuring Information Transfer. Physical Review Letters 2000, 85, 461-464. 

666 22. Nemenman, I.; Bialek, W.; de Ruyter van Steveninck, R. Entropy and information in neural spike 

667 trains: Progress on the sampling problem. Physical Review E 2004, 69, 561 1 1 . 

668 23. Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of 

669 samples. Physical Review E 1995, 52, 6S41. 

670 24. Nijenhuis, A.; Wilf, H.S. Combinatorial algorithms for computers and calculators; Computer 

671 Science and Applied Mathematics, Academic Press, 1978. 2nd ed. 

672 25. DeDeo, S. Effective theories for circuits and automata. Chaos 2011, 21, 7106. 

673 26. Olshausen, B.A.; Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed 

674 by VI ? Vision Research 1997, 37, 33 1 1-3325. 

675 27. Olshausen, B. A.; Field, D.J. Sparse coding of sensory inputs. Current opinion in neurobiology 

676 2004,74,481-487. 

677 28. Daniels, B.C.; Krakauer, D.C.; Flack, J.C. Sparse code of conflict in a primate society. 

678 Proceedings of the National Academy of Sciences 2012, 109, 14259-14264. 

679 29. Plato. Phaedrus; Harvard University Press, 1925. Plato in Twelve Volumes, Vol. 9. Translated 

680 by Harold N. Fowler; see http://www.perseus. tufts. edu/hopper/text?doc=Plat.-i-Phaedrus-i-265e. 



Version February 6, 2013 submitted to Entropy 



29 of 29 



681 (c) February 6, 2013 by the authors; submitted to Entropy for possible open access 

682 publication under the terms and conditions of the Creative Commons Attribution license 

683 http://creativecommons.Org/licenses/by/3.0/. 



